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Abstract 

The seemingly disjoint problems of count and mixture modeling are united under the negative 
binomial (NB) process. We reveal relationships between the Poisson, multinomial, gamma and Dirichlet 
distributions, and construct a Poisson-logarithmic bivariate count distribution that connects the NB and 
Chinese restaurant table distributions. Fundamental properties of the models are developed, and we 
derive efficient Bayesian inference. It is shown that with augmentation and normaUzation, the NB 
process and gamma-NB process can be reduced to the Dirichlet process and hierarchical Dirichlet 
process, respectively. These relationships highlight theoretical, structural and computational advantages 
of the NB process. A variety of NB processes including the beta-geometric, beta-NB, marked-beta- 
NB, marked-gamma-NB and zero-inflated-NB processes, with distinct sharing mechanisms, are also 
constructed. These models are applied to topic modeling, with connections made to existing algorithms 
under the Poisson factor analysis framework. Example results show the importance of inferring both 
the NB dispersion and probability parameters, which respectively govern the overdispersion level and 
variance-to-mean ratio for count modeling. 

Index Terms 

Beta process, Chinese restaurant process, completely random measures, clustering, count modeling, 
Dirichlet process, gamma process, hierarchical Dirichlet process, mixed membership modeling, mixture 
modeling, negative binomial process, Poisson factor analysis, Poisson process, topic modeling. 

I. Introduction 

Count data appear in many settings, such as modeling the number of insects in regions of 
interest Q, lEl, predicting the number of motor insurance claims [|2l and modeling topics 
of document corpora flU, JSl, H, Q. There has been increasing interest in count modeling 
using the Poisson process, geometric process (HI, BUl, [fTOl . [fTTI . [fT2ll and recently the negative 
binomial (NB) process [[3, [fT3l . [[T4|. Notably, we have shown in IQ and further demonstrated 
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in lfT4l that the NB process, originally constructed for count analysis, can be naturally applied 
for mixture modeling of grouped data a^i, ■ ■ ■ , xj, where each group Xj = {xji}i=i^Nj- 

Mixture modeling infers probability random measures to assign data points into clusters 
(mixture components), which is of particular interest to statistics and machine learning. Although 
the number of points assigned to clusters are counts, mixture modeling is not typically considered 
as a count-modeling problem. Clustering is often addressed under the Dirichlet-multinomial 
framework, using the Dirichlet process l|T5l , [fT6l , IITTll . IfTSl , [fT9l as the prior distribution. With 
the Dirichlet multinomial conjugacy, Dirichlet process mixture models enjoy tractability because 
the posterior of the probability measure is still a Dirichlet process. Despite its popularity, it 
is well-known that the Dirichlet process is inflexible in that a single concentration parameter 
controls the variability of the mass around the mean [fT9l , ll20ll : moreover, the inference of the 
concentration parameter is nontrivial, usually solved with the data augmentation method proposed 
in [[TTll . Using probability measures normalized from gamma processes, whose shape and scale 
parameters can both be adjusted, one may mitigate these disadvantages. However, it is not clear 
how the parameters of the normalized gamma process can still be inferred under the multinomial 
likelihood. For mixture modeling of grouped data, the hierarchical Dirichlet process (HDP) f2\\ 
has been further proposed to share statistical strength between groups. However, the inference 
of the HDP is a challenge and it is usually solved under alternative constructions, such as the 
Chinese restaurant franchise and stick-breaking representations [|2T]| . [l22l . [|23l . 

To construct more expressive mixture models, without losing the tractability of inference, 
in this paper we consider mixture modeling as a count-modeling problem. Directly modeling 
the counts assigned to clusters as NB random variables, we perform joint count and mixture 
modeling via the NB process, using completely random measures [|24l . [[8]|, [|25l . Il20ll that are 
simple to construct and amenable for posterior computation. We reveal relationships between 
the Poisson, multinomial, gamma, and Dirichlet distributions and their corresponding stochastic 
processes, and we connect the NB and Chinese restaurant table distributions under a Poisson- 
logarithmic bivariate count distribution. We develop data augmentation methods unique to the 
NB distribution and augment a NB process into both the gamma-Poisson and compound Poisson 
representations, yielding unification of count and mixture modeling, derivation of fundamental 
model properties, as well as efficient Bayesian inference using Gibbs sampling. 

Compared to the Dirichlet-multinomial framework, the proposed NB process framework pro- 
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vides new opportunities for better data fitting, efficient inference and flexible model constructions. 
We make four additional contributions: 1) we construct a NB process and a gamma- NB process, 
analyze their properties and show how they can be reduced to the Dirichlet process and the HDP, 
respectively, with augmentation and then normalization. We highlight their unique theoretical, 
structural and computational advantages relative to the Dirichlet-multinomial framework. 2) We 
show that a variety of NB processes can be constructed with distinct model properties, for 
which the shared random measure can be selected from completely random measures such as the 
gamma, beta, and beta-Bernoulli processes. 3) We show NB processes can be related to previously 
proposed discrete latent variable models under the Poisson factor analysis framework. 4) We 
apply NB processes to topic modeling, a typical example for mixture modeling of grouped data, 
and show the importance of inferring both the NB dispersion and probability parameters, which 
respectively govern the overdispersion level and the variance-to-mean ratio in count modeling. 

Parts of the work presented here first appeared in our papers Q, [|2l, [fT4ll . In this paper, we 
unify related materials scattered in these three papers and provide significant expansions. New 
materials include: we construct a Poisson-logarithmic bivariate count distribution that tightly 
connects the NB, Chinese restaurant table, Poisson and logarithmic distributions, extending the 
Chinese restaurant process to describe the case that both the number of customers and the 
number of tables are random variables. We show how to derive closed- form Gibbs sampling for 
hierarchical NB count models that can share statistical strength in multiple levels. We prove that 
under certain parameterizations, a Dirichlet process can be scaled with an independent gamma 
random variable to recover a Gamma process, which is further exploited to connect the NB and 
Dirichlet processes, and the gamma-NB process and HDP. We show that in Dirichlet process 
based models, the number of points assigned to a cluster is marginally beta-binomial distributed, 
distinct from the NB distribution used in NB process based models. In the experiments, we 
provide a comprehensive comparison of various NB process topic models and related algorithms, 
and make it clear that key to constructing a successful mixture model is appropriately modeling 
the distribution of counts, preferably to adjust two parameters for each count to achieve a balanced 
fit of both the mean and variance. 

We mention that beta-NB processes have been independently investigated for mixed member- 
ship modeling in [[T3l . with several notable distinctions: we study the properties and inference 
of the NB distribution in depth and emphasize the importance of learning the NB dispersion 
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parameter, whereas in [[T3l . the NB dispersion parameter is empirically set proportional to the 
group size Nj in a group dependent manner. We discuss a variety of NB processes, with beta-NB 
processes independently studied in and [[T3 ]| as special cases. We show the gamma-Poisson 
process can be marginalized as a NB process and normalized as a Dirichlet process, thus suitable 
for mixture modeling but restrictive for mixed membership modeling, as also confirmed by our 
experimental results; whereas in [|13L the gamma-Poisson process is treated parallel to the beta- 
NB process and considered suitable for mixed membership modeling. We treat the beta-NB 
process parallel to the gamma-NB process, which can be augmented and then normalized as an 
HDP; whereas in [fT3]|. the beta-NB process is considered less flexible than the HDP, motivating 
the construction of a hierarchical-beta-NB process. 

II. Preliminaries 

A. Completely Random Measures 

Following [20 1, for any z/+ > and any probability distribution n{dpduj) on the product space 
]R X 1], let ~ Pois(z/+) and {{pk,(^k)}i,K ~ TT{dpduj). Defining as being one if Uk & A 

and zero otherwise, the random measure C{A) = J2k=i '^A{^k)Pk assigns independent infinitely 
divisible random variables C{Ai) to disjoint Borel sets Ai C fi, with characteristic functions 

E[e^t^iA)] = exp |/ 4^^(e**P - l)u{dpduj)^ (1) 

with i/{dpdu) = iy^TT{dpdu). A random signed measure C satisfying ([T]) is called a Levy random 
measure. More generally, if the Levy measure ^{dpdu) satisfies 

//iRx5"^Wl, \p\}i^{dpduj) < oo (2) 

for each compact S C ^l, the Levy random measure C is well defined, even if the Poisson 
intensity z/+ is infinite. A nonnegative Levy random measure C satisfying ^ was called a com- 
pletely random measure in [|24l . ^ and an additive random measure in [|26l . It was introduced 
for machine learning in [TT] and [|25l . 

1) Poisson Process: Define a Poisson process X ~ PP(Go) on the product space Z+ x n, 
where Z_|_ = {0, 1, ■ • • }, with a finite continuous base measure Gq over such that X(A) ~ 
Pois(G'o(^)) for each subset A C ^l. The Levy measure of the Poisson process can be derived 
from ([T]) as u{duduj) = 5i{du)Go{duj), where 5i{du) is a unit point mass at n = L If Go 
is discrete (atomic) as Gq = J^k^k^uik^ then the Poisson process definition is still valid that 
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X = '^i^XkSuj,., Xk ~ Pois(Afc). If Go is mixed discrete-continuous, then X is the sum of two 
independent contributions. As the discrete part is convenient to model, without loss of generality, 
below we consider the base measure to be continuous and finite. 

2) Gamma Process: We define a gamma process (H, [|20l G ~ GaP(c, Gq) on the product 
space ]R_|_ x Vt, where IR+ = {x : x > 0}, with concentration parameter c and base measure 
Go, such that G{A) ~ Gamma(Go(A), 1/c) for each subset A C Vt, where Gamma(A;a,6) = 
r(a)b° -^" ''^6^^ ^nd r(-) denotes the gamma function. The gamma process is a completely random 
measure, whose Levy measure can be derived from ^ as 

pi^drdu) = r''^e~'^^drGo{dijj). (3) 

Since the Poisson intensity z/+ = z/(i?+ x fi) = oo and J j^^^^rv^drduj) is finite, there are 
countably infinite points and a draw from the gamma process can be expressed as 

^ = YlkLi^k^'^ky {rk,(^k) ~ nidrdu), TT{drdu)u^ = v{drduj). (4) 

3) Beta Process: The beta process was defined by [|28l for survival analysis with Vl = IR+. 
Thibaux and Jordan [|27l generalized the process to an arbitrary measurable space Vt by defining 
a completely random measure B on the product space [0, 1] x with Levy measure 

v^dpduj) = cp^^{l — pY^^dpBQ^du). (5) 

Here c > is a concentration parameter and Bq is a base measure over Since the Poisson 
intensity z/+ = z/([0, 1] x ^7) = oo and J J^^^^^^pu^dpduj) is finite, there are countably infinite 
points and a draw from the beta process B ~ BP(c, Bq) can be expressed as 

B = YlkLiPk^'^k^ {Pk,uJk) ~ iT{dpdu), n{dpdu)u^ = v{dpduj). (6) 

B. Dirichlet Process and Chinese Restaurant Process 

1) Dirichlet Process: Denote G = G/G{Vl), where G ~ GaP(c, Gq), then for any measurable 
disjoint partitionAi,-- - , of f], we have G(Ai),-- - ,G{Aq) ~ Dir (^7oGo(v4i), ■ ■ ■ ,7oGo(v4q) 
where 70 = Go(fi) and Gq = Go/70. Therefore, with a space invariant concentration parameter, 
the normalized gamma process G = G /G{Vl) is a Dirichlet process ifTSl . [|29ll with concentration 
parameter 70 and base probability measure Gq, expressed as G ~ DP(7o, Gq). Unlike the gamma 
process, the Dirichlet process is no longer a completely random measure as the random variables 
{G{Aq)} for disjoint sets {Aq} are negatively correlated. 
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2) Chinese Restaurant Process: In a Dirichlet process G ~ DP (70, Gq)^ we assume Xj ~ G; 
{Xi} are independent given G and hence exchangeable. The predictive distribution of a new 
data point X^+i, conditioning on Xi, ■ ■ ■ ,Xm, with G marginalized out, can be expressed as 



G 



"fc A -I- To C7^ 

fc=l m+'vn "^fe m+'vn '"^O ' ^ 



Xi, ■ ■ ■ , X, 

where {wfcji a' are discrete atoms in Vt observed in Xi, ■ ■ ■ ,X„j and Uk = J2^i^ii^k) is the 
number of data points associated with tUk- The stochastic process described in (|7]) is known as 
the Polya urn scheme [|30ll and also the Chinese restaurant process [1311 . ETTl . ||32ll . 

3) Chinese Restaurant Table Distribution: Under the Chinese restaurant process metaphor, the 
number of customers (data points) m is assumed to be known whereas the number of nonempty 
tables (distinct atoms) K is treated as a random variable dependent on m and 70. Denote s(m, I) 
as Stirling numbers of the first kind, it is shown in |fT6l that the random table count K has PMF 

Pr(ii' = /|m,7o) = f(Si)kK0l7^, / = 0,l,---,m. (8) 

We refer to this distribution as the Chinese restaurant table (CRT) distribution and denote / ~ 
CRT(m, 7o) as a CRT random variable. As shown in Appendix A, it can be sampled as / = 
Yl^=i ^n, bn ~ Bernoulli (7o/(?^ — 1 + 7o)) or by iteratively calculating out the PMF under the 
logarithmic scale. The PMF of the CRT distribution has been used to help infer the concentration 
parameter 70 in Dirichlet processes [fTTl . [[2T]|. Below we explicitly relate the CRT and NB 
distributions under a Poisson-logarithmic bivariate count distribution. 

in. Inference for the Negative Binomial Distribution 
The Poisson distribution m ~ Pois(A) is commonly used to model count data. It has probability 
mass function (PMF) /a/(?ti) = rn E Z+, with both the mean and variance equal to A. 

Due to heterogeneity (difference between individuals) and contagion (dependence between the 
occurrence of events), count data are usually overdispersed in that the variance is greater than 
the mean, making the Poisson assumption restrictive. By placing a gamma distribution prior with 
shape r and scale p/(l — p) on A as m ~ Pois(A), A ~ Gamma ^r, and marginalizing out 
A, a negative binomial (NB) distribution m ~ NB(r, p) is obtained, with PMF 

/M(m) = /,°° Pois(m; A)Gamma (A; r, ^) dX = - pfp"^ (9) 

where r is the nonnegative dispersion parameter and p is the probability parameter. Thus the 
NB distribution is also known as the gamma- Poisson mixture distribution [|33l . It has a mean 
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H = rp/{l — p) smaller than the variance cr^ = rp/{l — pY = /i + r^^jj?, with the variance- 
to-mean ratio (VMR) as (1 — p)^^ and the overdispersion level (ODL, the coefficient of the 
quadratic term in cr^) as r^^, and thus it is usually favored over the Poisson distribution for 
modeling overdispersed counts. As shown in [|34ll . m ~ NB(r, can also be generated from a 
compound Poisson distribution as 

m = Y!t=iUt, M/~Log(p), Z ~ Pois(-rln(l -p)) (10) 

where u ~ Log(p) corresponds to the logarithmic distribution [|35l . [|36l with PMF fu{k) = 
—p''/[kln{l — p)], A; G {1, 2, . . . }, and probability-generating function (PGF) 

Cu{z) = \n{l-pz)/\n{l-p), \z\ < p-\ (11) 

In a slight abuse of notation, but for added conciseness, in the following discussion we use 
m ~ Z]LiLog(p) to denote m = J2[=i Ut, Ut ~ Log(p). 

The NB distribution has been widely investigated and applied to numerous scientific studies 
Q, ll37l . [|38ll . [|39ll . Although inference of the NB probability parameter p is straightforward, 
as the beta distribution is its conjugate prior, inference of the NB dispersion parameter r, whose 
conjugate prior is unknown, has long been a challenge. The maximum likelihood (ML) approach 
is commonly used to estimate r, however, it only provides a point estimate and does not allow 
the incorporation of prior information; moreover, the ML estimator of r often lacks robustness 
and may be severely biased or even fail to converge, especially if the sample size is small [i40l , 
im, [l42l . P3ll . [l44l . P31 . Bayesian approaches are able to model the uncertainty of estimation 
and incorporate prior information, however, the only available closed-form Bayesian inference 
for r relies on approximating the ratio of two gamma functions [|461 . 

Lemma III.l. Augment m ~ NB(r, p) under the compound Poisson representation as m 
^l^j^Log(p), I ~ Pois(— rln(l — p)), then the conditional posterior of I given m and r can be 
expressed as /|m, r ~ CRT(m, r). 

Proof: Let m ~ SumLog(Z,p) be the sum-logarithmic distribution as m ~ Xll=iLog(p). 
Since m is the summation of / iid Log(p) random variables, its PGF becomes Cm{z) = C\j{z) = 
[ln(l — pz)/\n{l — p)f , \z\ < p~^. Using the properties that [ln(l + x)]' = /! Yl'^=i ^i^^ l)x'^/n\ 
and \s{mj)\ = (— l)™'"^s(m, /) |35|, we have the PMF of m ~ SumLog(/,p) as 

JM[m\L,p) - ^, - m\[-ln{l-p)y 
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Let (m, /) ~ PoisLog(r,p) be the Poisson-logarithmic bivariate count distribution that describes 
the joint distribution of counts m and / as m ~ X]t=iLog(p), / ~ Pois(— rln(l — p))- Since 
fM,L{m.,l\r,p) = fM{m\l,p)fL{l\r,p), we have the PMF of (m, /) ~ PoisLog(r,p) as 

, „^ _ p'mm,l)\ (-rln(l-p))'e'- _ |s(m,;)|r' 

Since fM,L{m,l\r,p) = f L{l\m,r) f M{m\r,p), we have 

f,f/|^ r\ — fM,L{m,l\r;p) _ |s(m,0|r'(l-p)'-p"' _ Fjr) i / ,n| ; 
J Lyi'l'll', I ) fM{m\r,p) ■m!NB(m;r,p) r(m+r) I v""' V I ' 

which is exactly the same as the PMF of the CRT distribution shown in ([8]). ■ 

Corollary III.2. The Poisson-logarithmic bivariate count distribution with PMF fM,L{'^, ^f^p) = 
\s{m^)\r _ pYpm expressed as the product of a CRT and a NB distributions and also 

the product of a sum-logarithmic and a Poisson distributions as 

PoisLog(m, /; r,p) = CRT(Z; m, r)NB(m; r,p) = SumLog(m; /,p)Pois(/; — r ln(l — p)). (14) 

Under the Chinese restaurant process metaphor, the CRT distribution describes the random 



number of tables occupied by a given number of customers. Using Corollary III.2, we may 
have the metaphor that the Poisson-Logarithmic bivariate count distribution describes the joint 
distribution of count random variables m and / under two equivalent circumstances: 1) there are 



/ ~ Pois(— rln(l — p)) tables with m ~ X]t=iLog(p) customers; 2) there are m ~ NB(r, 



customers seated on / ~ CRT(m, r) tables. Note that according to Corollary A. 2 in Appendix A, 
in a Chinese restaurant with concentration parameter r, around r In tables would be required 
to accommodate m customers. 

Lemma IIL3. Let m ~ NB(r, p), r ~ Gamma(ri, 1/ci) represent the gamma-NB distribution, 
denote p' = ~^^^(^i^ly then m can also be generated from a compound distribution as 

m ~ ELiLog(p), / ~ Ef'=iLog(p'), ~ Pois(-riln(l -p')) (15) 
which is equivalent in distribution to 

^ ~ EULog(p), /' ~ CRT(/,ri), / ~ NB(ri,p')- (16) 
Proof We can augment the gamma-NB distribution as m ~ Yl\=i Log(p)) ^ ~ Pois(— r ln(l — 
p)), r ~ Gamma(ri, 1/ci). Marginalizing out r leads to m ~ Yl\=i^'^^ip) ^ ^ ~ NB(ri,p'). 
Augmenting / using its compound Poisson representation leads to ( [T5| ). Using Corollary IIL2[ 



we have that (15) and (16) are equivalent in distribution. 
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Using Corollary |IIL2[ it is evident that to infer the NB dispersion parameter, we can place 
a gamma prior on it as r ~ Gamma(ri, 1/ci); with the latent count / ~ CRT(m, r) and the 
gamma-Poisson conjugacy, we can update r with a gamma posterior. We may further let ri ~ 



Gamma(r2, 1/C2); using Lemma IIL3 it is evident that with the latent count /' ~ CRT(/,ri), 
we can also update ri with a gamma posterior. Using Corollary [III. 2 and Lemma in.3[ we can 
continue this process repeatedly, suggesting that for data that have subgroups within groups, 
we may build a hierarchical model to share statistical strength in multiple levels, with tractable 
inference. To be more specific, assuming we have counts {mji, ■ ■ ■ ,mjNj}j=i,j from J data 
groups; to model their distribution, we construct a hierarchical model as 



rrii. 



NB{rj,pj), pj ~ Beta(ao,&o); ''"j ~ Gamma(ri, 1/ci), ri ~ Gamma(r2, 1/C2). 



Then Gibbs sampling proceeds as 

(PjH ~ Beta (^ao + E£i rriji, &o + Njr^^ , p'^ = '^^^^^^'^^^j,^ 
~ CRTK,,r,), (/;.|-) ~ CRT (Eflih^r,) 

n ~ Gamma (^rs + J^U ^'r c,-E/^!in(i-p^.) ) ' ^ G^^^^ (^1 + E£i hi^ c,-N,l{i-p, ) ^ 
The conditional posterior of the latent count / was first derived by us in [|2l and its analytical 
form as the CRT distribution was first discovered by us in |fT4l. Here we provide a more compre- 
hensive study to reveal connections between various discrete distributions. These connections are 
key ingredients of this paper, which not only allow us to unite count and mixture modeling and 
derive efficient inference, but also, as shown in Sections IV and|V| let us examine the posteriors to 
understand fundamental properties of the NB processes, clearly revealing connections to previous 
nonparametric Bayesian mixture models. 

IV. Negative Binomial Process Joint Count and Mixture Modeling 
A. Poisson Process for Joint Count and Mixture Modeling 

The Poisson distribution is commonly used for count modeling [|47l and the multinomial 
distribution is usually considered for mixture modeling, and their conjugate priors are the gamma 
and Dirichlet distributions, respectively. To unite count modeling and mixture modeling, and to 
derive efficient inference, we show below the relationships between the Poisson and multinomial 
random variables and the gamma and Dirichlet random variables. 
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Lemma IV.l. Suppose that xi, . . . ,xk are independent Poisson random variables with 

Xk ~ Pois(Afc), X = Ylk=i ^k- 

Set X = J2k=i -^fc'' (z/) Z/i) • • • ) Vk) be random variables such that 

y ~ Pois(A), (yi, ...,yk)\y^ Mult (y;^,...,^). 

Then the distribution ofx = (x, Xi, . . . , xk) is the same as the distribution ofy = {y,yi, . . . , yx) fTl . 

Corollary IV.2. Let X ~ PP(G') be a Poisson process defined on a completely random mea- 
sure G such that X{A) ~ Pois(G'(y4)) for each A d Vt. Define Y ~ MP(F(i7), as 
a multinomial process, with total count YiVt) and base probability measure -(j^, such that 
■ ■ ■ , F(Aq)) ~ Mult (Vin); f^, • • • , ^) for any disjoint partition o/fi; 
let Y{Q) ~ Pois(G'(r2)). Since X{A) and Y{A) have the same Poisson distribution for each 
A G Q, X and Y are equivalent in distribution. 

Lemma IV.3. Suppose that random variables y and {yi, . . . ,yK) are independent with y ~ 
Gamma(7, 1/c), {yi, . . . ^yx) ~ Dir(7Pi, ■ ■ ■ , IPk), where Y.k=iPk = 1- Let Xk = yyk, then 
{xk}i,K are independent gamma random variables with Xk ~ Gamma(7PA,., 1/c). 



Proo/:- Since = ?/(l-^f^/?/fc) and " g};'^','^^^^^ we have /xi,.. ■ ■ ■ > 2;^^) 

fYu-,YK-i{yi,--- ,yK~i)fY{y)y'^'^^ = nf=iGamma(xfc;7Pfc, l/c). ■ 

Corollary IV.4. If the gamma random variable a and the Dirichlet process G are independent 
with a ~ Gamma(7o, 1/c), G ~ DP(7o, Go), where 70 = Go{Q) and Go = Go/70, then the 
product G = aG is a gamma process with G ~ GaP(c, Gq). 

Using Corollary |IV.2 . we illustrate how the seemingly distinct problems of count and mixture 



djyi,- ,yK-i,y) 



modeling can be united under the Poisson process. Denote as a measure space and for each 
Borel set A C ^l, denote Xj(A) as a count random variable describing the number of observations 
in Xj that reside within A. Given grouped data Xi, - ■ ■ , Xj, for any measurable disjoint partition 
Ai,--- , Aq of ^l, we aim to jointly model the count random variables {Xj(Ag)}. A natural 
choice would be to define a Poisson process Xj ~ PP(G), with a shared completely random 
measure G on such that Xj{A) ~ Pois(G(A)) for each A C and G{n) = E?=iG'(^g)- 



Following Corollary IV.2, letting Xj ~ PP(G) is equivalent to letting 

Xj ~ MF{Xj{n),G), Xj{n) ~ Pois(G(r])) (17) 
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where G = G /G{VL). Thus the Poisson process provides not only a way to generate independent 
counts from each Aq, but also a mechanism for mixture modeling, which allocates the Xj(Vt) 
observations into any measurable disjoint partition {Ag}i,Q of Vt, conditioning on the normalized 
mean measure G. A distinction is that in most clustering models the number of observations 
Xj{Q) is assumed given and Xj{A) ~ Binomial(Xj(i7), G{A)), whereas here XjiVt) is modeled 
as a Poisson random variable and Xj{A) ~ Poisson(G'(yl)). 

B. Gamma-Poisson Process and Dirichlet Process 

To complete the Poisson process, we may place a gamma process prior on G as 

X, ~PP(G), j = l,--- ,J, G'~GaP(J(l-p)/p,G'o). (18) 

Here the base measures of the Poisson process (PP) and gamma process (GaP) are not restricted 
to be continuous. Marginalizing out G of the gamma-Poisson process leads to a NB process 
X = E/=i^i ~ NBP(Go,p), in which X{A) ~ NB(Go(A),j9) for each A C Q. We comment 
here that when J > 1, i.e., there is more than one data group, one need avoid the mistake 
of marginalizing out G in Xj ~ PP(G'), G ~ GaP(c, Go) as X^ ~ NBP(Go,l/(c+ 1)). The 
gamma-Poisson process has also been discussed in H, IfTOll , [fTTTl . [[T2ll for count modeling. 
Here we show that it can be represented as a NB process, leading to fully tractable closed-form 
Bayesian inference, and we demonstrate that it can be natrually applied for mixture modeling. 

Define L ~ CRTP(X, Go) as a CRT process that for each A C n, L{A) = Y^coeA ^(^)' ^(^) ~ 
CRT(X(a;), G'o(w)). Under the Chinese restaurant process metaphor, X(A) and L{A) represent 



the customer count and table count, respectively, observed in each A C ^l. Using Corollary III. 2 



their joint distribution is the same for: 1) first drawing L{A) ~ Pois(— G'o(A) ln(l — p)) tables 
and then assigning Log(p) customers to each table, with X(A) total number of customers; 
2) first drawing X(A) ~ 'NB(Go{A),p) customers and then assigning them into L(A) ~ 
CRT(X(a;), Go('^)) tables. Therefore, the NB process augmented under the compound 
Poisson representation as X ~ J2t=i Log(p), L ~ PP(— Gq p)) is equivalent in distribution 
to L ~ CRTP(X, Go), X ~ NBP(Go,p). These equivelent reprsentations allow us to derive 
closed-form Bayesian inference for the NB process. 

If we impose a gamma prior Gamma(eo, l//o) on 70 = Go{Vl) and a beta prior Beta(ao, I/&0) 
on p, using the conjugacy between the gamma and Poisson and the beta and NB distributions, 
we have the conditional posteriors as 
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G\X,p, Go ~ GaP(J/j9, Go + X), {p\X, G) ~ Beta(ao + X(^]), 60 + 7o) 

L\X,Go ~ CRTP(X,G'o), ilo\L,p) ~ Gamma (eo + • (19) 

If the base measure Go is continuous, then Go(a;) — )■ and we have L{uj) ~ CRT(X(t<j), Go(w)) = 
5(X(u;) > 0) and thus = J^uen^i^i^) ^ '•'^^ number of tables is equal to K^, 

the number of observed discrete atoms. The gamma-Poisson process is also well defined with a 
discrete base measure as Go = J2k=i 'x^'^k^ which becomes continuous only if K — > 00. With 
such a base measure, we have L = Yl!k=i^k5'^k-> ~ CRI{X{ujk),'jo/ K); it becomes possible 
that Ik > 1 if X(uJk) > 1, which means L{^1) > . Thus when Go is discrete, using the 
number of observed atoms instead of the the number of tables L(yt) to update the mass 
parameter 70 may lead to a biased estimate, especially if K is not sufficiently large. 



Based on Corollaries IV.2 and IV.4 the gamma-Poisson process is equivalent to 



Xj ~ MP(Xj(f]),G), G ~ DP(7o,Go), Xj(fi) ~ Pois(a), a ~ Gamma(7o,p/(J(l-p))) (20) 

where G = aG and Go = 7oGo. Thus without modeling XjiVt) and G{Vt) = a as random 
variables, the gamma-Poisson process becomes the Dirichlet process, which is widely used for 
mixture modeling [|T5ll , [fTTll . [|29l , [fT9ll . Note that for the Dirichlet process, no analytical forms 
are available for the conditional posterior of 70 when Go is continuous [fTTll and no rigorous 
inference for 70 is available when Go is discrete. Whereas for the proposed gamma-Poisson 



process augmented from the NB process, as shown in ( 19), the conditional posteriors are analytic 



regardless of whether the base measure Go is continuous or discrete. 

C. Block Gibbs Sampling for the Negative Binomial Process 

For a finite continuous base measure, a draw from the gamma process G ~ GaP(c, Gq) 
can be expressed as an infinite sum as in Q. Here we consider a discrete base measure as 
= Ef=i^'^<^fc' then we have G = Ylk=i^k^^k^ ^k ~ Gamma (70 /K, 1/c), Uk ~ fi-ol^fc), 
which becomes a draw from the gamma process with a continuous base measure as — )■ 00. 
Let Xji ~ F(w2j.) be observation i in group j, linked to a mixture component w^^- G Vt through 
a distribution F. Denote rij^ = ^fdi = k), we can express the NB process as 

Xji ~ F{uz^J, Uk ~ go{(^k), Nj = X;f=i rijk, Ujk ~ Pois(rfc) 
Tfc ~ Gamma(7o/ir,p/(J(l -p))), p ~ Beta(ao,6o), 7o ~ Gamma(eo, I//0) (21) 
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where marginally we have rik = J2j=i ^jk ~ NB (70 /iC, . Note that if J > 1, one need avoid 
marginalizing out in rijk ~ Pois(rjt), ~ Gamma(7o/-ft', 1) as rijk ~ NB(7o/-ft', 1/2). Denote 
r = J2k=i''^k, using Lemma IV. 1 we can equivalently express Nj and rijk in (21) as 



Nj ~ Pois (r) , (n,i, ■ ■ ■ , n,i^) ~ Muh (A^^-; ri/r, ■ ■ ■ , r^^/r) . (22) 

Since the data {xjj}j=i 7Vj are fully exchangeable, rather than drawing {uji, ■ ■ ■ ^Uj^) once, we 
may equivalently draw index Zji for each Xji and then calculate rijk as 

Zji ~ Discrete (ri/r, ■ ■ ■ , r^^/r) , rijk = T.f=i^i^ji = (23) 

This provides further insights on how the seemingly disjoint problems of count and mixture 
modeling are united under the NB process framework. Following ( [T9] ), the block Gibbs sampling 
is straightforward to write as 

Vx{zji = k\-) (X F{xji] ujk)rk, {p\-) ~ Beta (^Oq + Y.j=^ ^0 + ^0) 

(/fc|-) ~ CRT(nfc,7o/fs:), (70I-) ~ Gamma l^eo + Ef=i^fc. TbZ^iZ^^ 
(rfc|-) ~ Gamma(7o/-ft' + nfc,p/J) , p{uk\-) (xYl^..^^F{xji;Uk)go{ojk). (24) 

If (7o(i^) is conjugate to the likelihood F(x;uj), then the conditional posterior of u would be 
analytic. Note that when — )■ 00, we have {lk\—) = ^{^k > 0) and then J2k=i = ■ 



Without modeling Nj and r as random variables, we can re-express ( [2T[ ) as 

Xji ~ F(w^^J, Zji ~ Discrete(f), f ~ Dir(7o/i^, ■ ■ ■ , ^o/K), 70 ~ Gamma(eo, I//0) (25) 

which loses the count modeling ability and becomes a finite representation of the Dirichlet 
process [fTSl . [|29ll . The conditional posterior of r is analytic, whereas 70 can be sampled as in 
IfTTll when K ^ 00. This also implies that by using the Dirichlet process as the foundation, 
traditional mixture modeling may discard useful count information from the beginning. 

The Poisson process has an equal-dispersion assumption for count modeling. For mixture 
modeling of grouped data, the gamma-Poisson process augmented from the NB process might 



be too restrictive in that, as shown in (20), it implies the same mixture proportions across groups. 
This motivates us to consider adding an additional layer into the gamma-Poisson process or using 
a different distribution other than the Poisson to model the counts for grouped data. As shown in 



Section III the NB distribution is an ideal candidate, not only because it allows overdispersion. 
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but also because it can be equivalently augmented into a gamma-Poisson and a compound 
Poisson representations; moreover, it can be used together with the CRT distribution to form a 
Poisson-logarithmic bivariate distribution to jointly model the counts of customers and tables. 

V. Count and Mixture Modeling of Grouped Data 

For joint count and mixture modeling of grouped data, we explore sharing the NB dispersion 
while the probability parameters are group dependent. We construct a gamma-NB process as 

Xj ~ NBP(G,Pj), G ~ GaP(c, Go). (26) 

Note that we may also let Xj ~ NBP(a;jG',pj) and place a gamma prior on aj to increase model 
flexibility, whose inference will be slightly more complicated and thus omitted here for brevity. 
The gamma-NB process can be augmented as a gamma-gamma-Poisson process as 

Xj ~ PP(ej ) , ~ GaP( ( 1 - ) /pj , G) , G ~ GaP(c, Go) . (27) 

This construction introduces gamma random measures Qj based on G, which are essential to 
construct group-level probability measures Qj to assign observations to mixture components. 
The gamma-NB process can also be augmented under the compound Poisson representation as 

~ EfiiLog(Pj)> L.J ~ PP(-Gln(l - j9,)), G ~ GaP(c, Go) (28) 



which, using Corollary III.2[ is equivalent in distribution to 



L.j ~ CRTP(Xj, G), X, ~ NBP(G,Pj), G ~ GaP(c, Go). (29) 



Using Lemma III.3 and Corollary III. 2 we further have two equivalent augmentations as 



L - EtLi Log(p'), L' ~ PP(-Go ln(l - p')), p' = (30) 
L' ~ CRTP(L,Go), L~NBP(Go,p') (31) 

where L = '^jLj. These augmentations allow us to derive a sequence of closed-form update 
equations for inference with the gamma-NB process, as described below. 

A. Model Properties 

Let pj ~ Beta(ao, bo), using the beta NB conjugacy, we have 

{Pj\-) ~ Beta (ao + Xj(fi), bo + G{n)) . (32) 
Using (29) and (31), we have 

Lj\Xj,G ^CRTP{Xj,G), L'|L,Go ~ CRTP(L,Go). (33) 
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If Go is continuous and finite, we have Gq^u) W u E Q and thus L'{Q)\L,Gq = 
E.enm^) > 0) = E^en^iEjX.iuj) > 0) = K^; if Go is discrete as Go = Ef=i 
then L'{ujk) = CKT{L{uk),^) > 1 if J2j^jM > 0, thus L'(fi) > K+. In either case, let 
7o = Go(^^) ~ Gamma(eo, ^/ fo)^ with the gamma Poisson conjugacy on (28) and (30), we have 

^o\{L'{n),p'} ~ Gamma (eo + L'{n), ; (34) 

G|Go, {L„p^} ~ GaP (c - ln(l - p,), Gq + l) . (35) 



Using the gamma Poisson conjugacy on (27), we have 

0,|G, Xj,pj ~ GaP {1/pj, G + Xj) . (36) 
Since the data {xj,i}i are exchangeable within group j, the predictive distribution of a point Xji, 
conditioning on Xp = {Xjn}n:n^i and G, with Qj marginalized out, can be expressed as 

i ^ Ei0jiQ)\G,xp] G{n)+Xjin)-i ~^ Gin)+Xj(n)-i- v-''^ 

B. Relationship with the Hierarchical Dirichlet Process 

Based on Corollaries IV.2 and IV.4[ we can equivalently express (27) as 

Xj{Q) ~ Pois(^j), Oj ~ Gamma(a,pj/(1 - pj)) (38) 

Xj ~ MP{Xj{n), Bj), Bj ~ DP(a, G), a ~ Gamma(7o, 1/c), G ~ DP(7o, Go) (39) 

where Qj = 9jQj, G = aG and Go = 7oGo. Without modeling Xj(n) and 9j as random 
variables, (39) becomes an HDP [|2T|. Thus the augmented and then normalized gamma-NB 
process leads to an HDP However, we cannot return from the HDP to the gamma-NB process 
without modeling Xj(Q) and 9j as random variables. Theoretically, they are distinct in that the 
gamma-NB process is a completely random measure, assigning independent random variables 
into any disjoint Borel sets {Ag}i Q in ^l, and the count Xj(A) has the distribution as Xj(A) ~ 
NB{G{A),pj); by contrast, due to normalization, the HDP is not, and marginally 

Xj{A) ~ Beta-Binomial(Xj(fi),aG(A),a(l - G{A))). (40) 
Practically, the gamma-NB process can exploit conjugacy to achieve analytical conditional 
posteriors for all latent parameters. The inference of the HDP is a challenge and it is usually 
solved through alternative constructions such as the Chinese restaurant franchise (CRF) and 
stick-breaking representations ^2\\. [|23l . In particular, without analytical conditional posteriors, 
the inference of concentration parameters a and 70 is nontrivial [1211 . [|22ll and they are often 
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simply fixed [|23ll . Under the CRF metaphor a governs the random number of tables occupied 
by customers in each restaurant independently; further, if the base probability measure Gq 
continuous, 70 governs the random number of dishes selected by tables of all restaurants. One 
may apply the data augmentation method of [|T7| to sample a and 70. However, if Gq is discrete 
as Go = X]fe=i i<_^^k^ which is of practical value and becomes a continuous base measure as 
K ^ 00 [|29ll . (211, [[22|. then using the method of [[TTl to sample 70 is only approximately 
correct, which may result in a biased estimate in practice, especially if K is not large enough. 

By contrast, in the gamma-NB process, the shared gamma process G can be analytically 
updated with ([35]) and G{Vi) plays the role of a in the HDP, which is readily available as 



G(fi)|Go, {L„p,},=i,^ ~ Gamma(7o + L.in), ,_y.M^-p,) ) ^"^^^ 
and as in ( [34] ), regardless of whether the base measure is continuous, the total mass 70 has an 
analytical gamma posterior whose shape parameter is governed by L'(Vt), with ViVt) = if 



Go is continuous and finite and L'{Vt) > if Go = Yl!k=i '^^'^k - Equation (41 ) also intuitively 
shows how the NB probability parameters {pj} govern the variations among {Aj} in the gamma- 
NB process. In the HDP, pj is not explicitly modeled, and since its value becomes irrelevant when 
taking the normalized constructions in ( [39[ ), it is usually treated as a nuisance parameter and 
perceived as pj = 0.5 when needed for interpretation purpose. Fixing pj = 0.5 is also considered 
in [|48ll to construct an HDP, whose group-level DPs are normalized from gamma processes with 
the scale parameters as = 1; it is also shown in [48] that improved performance can be 
obtained for topic modeling by learning the scale parameters with a log Gaussian process prior. 
However, no analytical conditional posteriors are provided and Gibbs sampling is not considered 
as a viable option [|48l . 



C. Block Gibbs Sampling for the Gamma-Negative Binomial Process 



As with the NB process described in Section IV with a discrete base measure as Go = 
^k=i 'K^'^k^ '^^^ express the gamma-NB process as 

Xji ~ F{uj^^^), Uk ~ goiuJk), Nj = Y.k=i'^jk, ~ Pois(6'jA:), djk ~ Gamma(rfe,]9j/(1 - pj)) 

Tk ~ Gamma(7o/i^, 1/c), pj ~ Beta(ao, 60), 7o ~ Gamma(eo, I//0) (42) 



where marginally we have rijk ~ NB(rfc,p-,). Following Section V-A the block Gibbs sampling 



for (42) is straightforward to write as 
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Pr{zji = k\-) (X F{xji] Uk)9jk, {ljk\-) ~ CRT(njfc, r^), (^1-) ~ CRT Ijk, 7o/^) 
{p,H ~ Beta (ao + iV, , 60 + , P' = ^^^S^) 



(70I-) ~ Gamma (^cq + Efc 



1 



'fc' /o-ln(l-p') ^ 

(rfcl-) ~ Gamma + J2j hk, e-E, in(i- P,), 

~ Gamma(rfc + njk,Pj), p{uJk\-) oc [1^^,=?= -^(a^ii; Wfc)5fo(c<;fc) (43) 

which has similar computational complexity as that of the direct assignment block Gibbs sam- 
pling of the CRF-HDP [EU, Note that when K ^ 00, we have (/^|-) = 6{^j Ijk > 0) = 
SiEj^jk > 0) and thus = K^- 

Without treating Nj and 6j as random variables, we can re-express (42) as 



Zji ~ Discrete(0j), 6j ~ Dir(ar), a ~ Gamma(7o, 1/c), r ~ Dir(7o/-ft', ■ ■ ■ ^^jq/K) (44) 

which becomes a finite representation of the HDP, the inference of which is usually solved under 
the Chinese restaurant franchise ETI . [|22ll or stick-breaking representations [|23l . 

VI. The Negative Binomial Process Family 
The gamma-NB process shares the NB dispersion across groups while the NB probability 
parameters are group dependent. Since the NB distribution has two adjustable parameters, we 
may explore alternative ideas, with the NB probability measure shared across groups as in [13], 
or with both the dispersion and probability measures shared as in [7J. These constructions are 
distinct from both the gamma-NB process and HDP in that 6j has space dependent scales, and 
thus its normalization 9^ = 6j/6j(f2), which is still a probabihty measure, no longer follows 
a Dirichlet process. 

It is natural to let the NB probability measure be drawn from the beta process ll28ll . [|27l . A 
beta-NB process 0, jBl can be constructed by letting Xj ~ NBP(rj, 5), B ~ BP(c, Bq), with 
a random draw expressed as Xj = YlT=i ^jk^ojki "^jk ~ NB(rj,pfc). Under this construction, the 
NB probability measure is shared and the NB dispersion parameters are group dependent. Note 
that if Tj are fixed as one, then the beta-NB process reduces to the beta-geometric process, related 
to the one for count modeling discussed in JIIll; if Vj are empirically set to some other values, then 
the beta-NB process reduces to the one proposed in [fT3l . These simplifications are not considered 
in the paper, as they are often overly restrictive. As in [T], we may also consider a marked- 
beta-NB process, with both the NB probability and dispersion measures shared, in which each 
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point of the beta process is marked with an independent gamma random variable. Thus a draw 
from the marked-beta process becomes (R, B) = J2'h=ii^k,Pk)Sujk^ '^he NB process Xj ~ 
NBP(_R, B) becomes Xj = Yl'kLi 'i^-jk^ui^, njk ~ NB(rfc,pfc). Since the beta and NB distributions 
are conjugate, the posterior of B is tractable, as shown in |fT3l . Similar to the marked-beta-NB 
process, we may also consider a beta marked-gamma-NB process, whose performance is found 
to be similar. If it is believed that there are excessive number of zeros, governed by a process 
other than the NB process, we may introduce a zero inflated NB process as Xj ~ NBF(RZj,pj), 
where Zj ~ BeP(i?) is drawn from the Bernoulli process ll27l and {R, B) = Yl'h=ii''"ki'^k)Sujk 
is drawn from a marked-beta process, thus rijk ~ NB{rkbjk,Pj), bjk = Bernoulli(7rfc). This 
construction can be linked to the model in [|49l with appropriate normalization, with advantages 
that there is no need to fix pj = 0.5 and the inference is fully tractable. The zero inflated 
construction can also be linked to models for real valued data using the Indian buffet process 
(IBP) or beta-Bernoulli process spike-and-slab prior [|50l, [|5T1l, [|52]|. ||53]|, [|54||, [|55ll. [|56ll. [|57]|. 
More details on the NB process family can be found in Appendix B. 

VII. Negative Binomial Process Topic Modeling and 
PoissoN Factor Analysis 
We consider topic modeling (mixed membership modeling) of a document corpus, a special 
case of mixture modeling of grouped data, where the words of the jth document Xji, ■ ■ ■ , Xj^j 
constitutes a group Xj (Nj words in document j), each word Xji is an exchangeable group member 
indexed by vji in a vocabulary with V unique terms. The likelihood F(xji; cj)^ is simply 0^,^-^, 
the probability of word Xji under topic (atom/factor) (f)^. E MX , with Ylv=i = 1- We refer to 
NB process mixture modeling of grouped words {ajjji j as NB process topic modeling. 

Denote n^jk = Ylf=i^i^ji = ^^'^ji = ^jk = Hv^^jk^ ^^-k = Y.j^vjk and n.k = Y^j^jk, 
and for modeling convenience, we place Dirichlet priors on topics 0^ ~ Dir(r7, ■ ■ ■ ,77), then for 



block Gibbs sampling of the gamma-NB process in (43) with K atoms, we have 

Pr(^.. = fe|-)= (45) 

i(pk\-) ~Dir(?7 + ni.fc,-- - ,?7 + ny.fc) (46) 

which would be the same for the other NB processes, since the gamma-NB process differs from 
them on how the gamma priors of 9jk and consequently the NB priors of njk are constituted. 
For example, marginalizing out 9jk, we have rijk ~ NB(rk,Pj) for the gamma-NB process. 
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rijk ~ NB(rj,pjfc) for the beta-NB process, rijk ~ NB(rfc,pfc) for both the marked-beta-NB and 
marked-gamma-NB processes, and njk ~ NB{rkbjk,Pj) for the zero-inflated-NB process. 

Since in topic modeling the majority of computation is spent updating Zji, 4>f. and 9jk, the 
proposed Bayesian nonparametric models pay a small amount of additional computation, relative 
to parametric ones such as latent Dirichlet allocation (LDA) [|5J, for updating other parameters. 

A. Poisson Factor Analysis 

Note that under the bag-of-words representation (the ordering of words in a document is not 
considered), without losing information, we can form {xj}i j as a term-document count matrix 
M G M^^"', where niyj counts the number of times term v appears in document j. Given 
K < oo and a count matrix M, discrete latent variable models assume that each entry m^j can 
be explained as a sum of smaller counts, each produced by one of the K hidden factors, or in 
the case of topic modeling, a hidden topic. We can factorize M under the Poisson likelihood as 

M = Pois($e) (47) 
where $ G MY^^ is the factor loading matrix, each column of which is an atom encoding 
the relative importance of each term; © G M^^"' is the factor score matrix, each column of 
which encodes the relative importance of each atom in a sample. This is called Poisson Factor 
Analysis (PFA). We may augment ra^j ~ Pois(X]A^i 4>vk0jk) as 

^vj = Y.k=i ^vjk, nyjk ~ Pois{(f)ykOjk)- (48) 



IV. 1 



we also have 



and if Ylv=i = 1, we have Ujk ~ Pois(6'jfc), and with Lemma 

(n,, . . . , ~ Mul. (™,; .... j^g^) (49) 

(n^.i,--- ,ra^./r|-) ~ Muh(n.fc;0;,), (riji,-- - , n^xl-) ~ Muh (A/^; 0^) (50) 



where (49) would lead to (45) under the assumption that the words {xji}i are exchangeable and 



(50) would lead to (46) if 0^ ~ Dir(r7, ■ ■ • ,r]). Thus the NB process topic modeling can be 
considered as factorization of the term-document count matrix under the Poisson likelihood as 
M ~ Pois(#©), with the requirement that Y^=i 'Pvk = 1 (implying rijk ~ Pois(6'jfc)). 

B. Related Discrete Latent Variable Models 

We show below that previously proposed discrete latent variable models can be connected 
under the PFA framework, with the differences mainly on how the priors of (p^k and Ojk are 
constituted and how the inferences are implemented. 
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1) Latent Dirichlet Allocation: We can construct a Dirichlet-PFA (Dir-PFA) by imposing 

Dirichlet priors on both 0^, = {(pik, ■ ■ ■ , (l)vkY ^^^^ = {^jii ' i ^jkY 

0,~Dir(r/,--- ,r/), 0, ~ Dir (a/iT, ■ ■ ■ , ) . (51) 



Sampling 2:^4 with (45), which is the same as sampling n^jt with (49), and using (50) with the 
Dirichlet multinomial conjugacy, we have 

~Dir(?7 + ni.fc,-- - ,?7 + ny.fc), {6j\-) ~ Dir (a/i^ + n^i, ■ ■ ■ ,a/K + njK)- (52) 



Using variational Bayes (VB) inference [1581 . 115911 , we can approximate the posterior distribution 
with the product of Qk.i,- ,n,jK) = Muh (^m^^; Ci-ji, ■ ■ ■ ,Ci>ix), = Dir(a0ife,-- - ,0^^^) 
and Qflj = Dir (agji, ■ ■ ■ , a^jx) for = 1, ■ ■ ■ , V^, j = 1, ■ ■ ■ , J and /c = 1, ■ ■ ■ , i^T, where 

; _ exp((ln(^„fc) + (lnejfc)) 

a</.,;A; = + agjfc = a/fC + (n^-fc); (54) 

these moments are calculated as (ln0^fc) = ipia^vk) - (Zl!/=i ^^t^'fc)' (In^'jfc) = i^iagjk) - 
V' (^k'=i ^ejk'^ ^^'^ {f^vjk) = TTT'vjCvjk, whcrc ip{x) is the diagmma function. Therefore, Dir- 
PFA and LDA Q, II6OII have the same block Gibbs sampling and VB inference. It may appear 
that Dir-PFA should differ from LDA via the Poisson distribution; however, imposing Dirichlet 
priors on both factor loadings and scores makes it essentially loose that distinction. 

2) Nonnegative Matrix Factorization and a Gamma-Poisson Factor Model: We can construct 
a Gamma-PFA (F-PFA) by imposing gamma priors on both cp^k and Ojk as 

(t)vk ~ Gamma(a0, l/6</,), 9jk ~ Gamma(a0, fiffc/a^). (55) 

Note that if we set 6^ = 0, a<^ = = 1 and = 00, then we are imposing no priors on 
and 6jk, and a maximum a posterior (MAP) estimate of F-PFA would become an ML estimate 
of PFA. Using ( [48] ) and ( [55] ), one can show that 

{(l)vk\-) ~ Gamma(a0 + n^.^, l/{b^ + 6.^)) (56) 

{Ojk\-) ~ Gamma(a9 + rijk, l/{ae/gk + (p.k)) (57) 



where O.k = ZlLi ^jfc, (p-k = YZ=i^vk- If a</, > 1 and ae > 1, using (|49|), (|56|) and (|57|), we 



■'3 



can substitute E[?7,^,,fc|0t,A;, 6',^] = "^^'^"''^j'' into the modes of (f)pk and 6ki, leading to a MAP 
Expectation-Maximization (MAP-EM) algorithm as 

^..=A. °-' cs"'-" . g.^=g.^ '" (58) 
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If we set = 0, a^p = ag = 1 and gk = oo, the MAP-EM algorithm reduces to the ML-EM 
algorithm, which is found to be the same as that of non-negative matrix factorization (NMF) 
with an objective function of minimizing the KL divergence D/<i(M| |$0) [l6Tl. If we set 



= and = 1, then the update equations in (58) are the same as those of the gamma- 
Poisson (Gap) model of [6J, in which setting ag = 1.1 and estimating gk with g^ = ^[9jk] 
are suggested. Since all latent variables are in the exponential family with conjugate update, 



following the VB inference for Dir-PFA in Section VII-B 1 , we can conveniently derive the VB 
inference for F-PFA, omitted here for brevity. Note that the inference for the basic gamma- 
Poisson model and its variations have also been discussed in detail in [|62l . Il63l . Here we 



show using Lemma IV. 1 the derivations of the ML-EM, MAP- EM, Gibbs sampling and VB 
inference are all straightforward. The NMF has been widely studied and applied to numerous 
applications, such as image processing and music analysis (HI, [|64l . Showing its connections 
to NB process topic modeling, under the Poisson factor analysis framework, may allow us to 
extend the proposed nonparametric Bayesian techniques to these broad applications. 
C. Negative Binomial Process Topic Modeling 

From the point view of PEA, a NB process topic model factorizes the count matrix under the 
constraints that each factor sums to one and the factor scores are gamma distributed random 
variables, and consequently, the number of words assigned to a topic (factor/atom) follows a NB 
distribution. Depending on how the NB distributions are parameterized, as shown in Table |I[ we 
can construct a variety of NB process topic models, which can also be connected to previous 
parametric and nonparametric topic models. For a deeper understanding on how the counts are 
modeled, we also show in Table |I] both the VMR and ODL implied by these settings. 

We consider eight differently constructed NB processes in Table |lj (z) The NB process 



described in (21) is used for topic modeling. It improves over the count-modeling gamma- 
Poisson process discussed in [flOll , [|TT]| in that it unites mixture modeling and has closed-form 
Bayesian inference. Although this is a nonparametric model supporting an infinite number of 
topics, requiring {9jk}j=i,j = Vk may be too restrictive. (//) Related to LDA ^ and Dir-PFA [7], 
the NB-LDA is also a parametric topic model that requires tuning the number of topics. However, 
it uses a document dependent and pj to automatically learn the smoothing of the gamma 
distributed topic weights, and it lets Vj ~ Gamma(7o, 1/c), 70 ~ Gamma(eo, I//0) to share 
statistical strength between documents, with closed-form Gibbs sampling. Thus even the most 
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TABLE I 

A VARIETY OF NEGATIVE BINOMIAL PROCESSES ARE CONSTRUCTED WITH DISTINCT SHARING MECHANISMS, 
REFLECTED WITH WHICH PARAMETERS FROM 9jk, Tk, Tj, pk, Pj AND TTfe (bjk) ARE INFERRED (INDICATED BY A 

check-mark /), and the implied vmr and odl for counts {njk}j,k- they are applied for topic 
modeling of a document corpus, a typical example of mixture modeling of grouped data. 
Related algorithms are shown in the last column. 



Algorithms 




Tk 




Pk 




TTfc 


VMR 


ODL 


Related Algorithms 


NB 


djk = Tk 


/ 












-1 


Gamma-Poisson 1101. 1111 


NB-LDA 


/ 




/ 




/ 






-1 


LDA 1 5 1, Dir-PFA |7| 


NB-HDP 


/ 


/ 






0.5 




2 


r-k' 


HDPim. DILN-HDP |48| 


nb-ftm 


/ 


/ 






0.5 


/ 


2 


{rk)-%k 


FTM EH, SiL-PFA (7| 


Beta-Geometric 


/ 




1 


/ 






{l-Pk)-' 


1 


Beta-Geometric iTT], BNBP Q, (HI 


Beta-NB 


/ 




/ 


/ 






{l-Pk)-' 




BNBP |7J, 113J 


Gamma-NB 


/ 


/ 






/ 






r-k' 


CRF-HDP (21] , (22) 


Marlced-Beta-NB 


/ 


/ 




/ 








r-k' 


BNBP (3 



basic parametric LDA topic model can be improved under the NB count modeling framework. 
{Hi) The NB-HDP model is related to the HDP [|2T|. and since pj is an irrelevant parameter in the 
HDP due to normalization, we set it in the NB-HDP as 0.5, the usually perceived value before 
normalization. The NB-HDP model is comparable to the DILN-HDP fA^] that constructs the 
group-level DPs with normalized gamma processes, whose scale parameters are also set as one. 
{iv) The NB-FTM model introduces an additional beta-Bernoulli process under the NB process 
framework to explicitly model zero counts. It is the same as the sparse-gamma-gamma- PFA (S7r- 
PFA) in [7] and is comparable to the focused topic model (FTM) [|49l . which is constructed 
from the IBP compound Dirichlet process. The Zero-Inflated-NB process improves over these 
approaches by allowing pj to be inferred, which generally yields better data fitting, (v) The 
Gamma-NB process explores sharing the NB dispersion measure across groups, and it improves 
over the NB-HDP by allowing the learning of pj. It reduces to the HDP [1211 by normalizing both 
the group-level and the shared gamma processes, (v/) The Beta-Geometric process explores the 
idea that the probabihty measure is shared across groups, which is related to the one proposed 
for count modeling in [fTT]|. It is restrictive that the NB dispersion parameters are fixed as one. 
{vii) The Beta-NB process explores sharing the NB probability measure across groups, and 
it improves over the Beta-Geometric process and the beta negative binomial process (BNBP) 
proposed in [13|, allowing inference of r^. (vzzz) The Marked-Beta-NB process is comparable 
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to the BNBP proposed in [|71, with the distinction that it allows analytical update of r^. The 
constructions and inference of various NB processes and related algorithms in Table |l] all follow 



the formulas in (|42|) and (43), respectively, with additional details presented in Appendix B. 



Note that as analyzed in Section VIT NB process topic models can also be considered as 
factor analysis of the term-document count matrix under the Poisson likelihood, with 0^ as the 
kth factor that sums to one and 9jk as the factor score of the jth document on the kth factor, 
which can be further linked to nonnegative matrix factorization [1611 and a gamma Poisson factor 
model If except for proportions Xj and f, the absolute values, e.g., 6jk, and pk, are also of 
interest, then the NB processes based joint count and mixture models would be more appropriate 
than the Dirichlet process and the HDP based mixture models. 

VIIL Example Results 

Motivated by Table |I} we consider topic modeling using a variety of NB processes, which 
differ on how the NB dispersion and probability parameters of the latent counts {njk}j^k are 
learned and consequently how the VMR and ODL are modeled. We compare them with LDA 
[El, [[65l[ and CRF-HDP [|2TI . [[221 . in which the latent count Uj^ is marginally distributed as 

rijk ~ Beta-Binomia^A^j-, af^, a(l — fk)) (59) 

with ffc fixed as 1/K in LDA and learned from the data in CRF-HDP. For fair comparison, they 
are all implemented with block Gibbs sampling using a discrete base measure with K atoms, 
and for the first fifty iterations, the Gamma-NB process with = 50 /K and pj = 0.5 is used 
for initialization. We consider 2500 Gibbs sampling iterations and collect the last 1500 samples. 

We consider the Psychological Review}^ corpus, restricting the vocabulary to terms that occur in 
five or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with V = 2566 
and 71,279 total word counts. We randomly select 20%, 40%, 60% or 80% of the words from 
each document to learn a document dependent probability for each term v and calculate the 
per-word perplexity on the held-out words as 

Perplexity = exp (- ^ E/=i ELi Vjv log , = /^^y^^^^., (60) 



http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm 
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Fig. 1. Distinct stiaring mechanisms and model properties are evident between various NB process topic models, by comparing 
their inferred NB dispersion and probability parameters. Note that the transition between active and non-active topics is very 
sharp when pk is used and much smoother when is used. Both the documents and topics are ordered in a decreasing order 
based on the number of words associated with each of them. These results are based on the last Gibbs sampling iteration, on 
the Psychological Review corpus with 80% of the words in each document used as training. The values are shown in either 
linear or log scales for convenient visualization. 



where yjy is the number of words held out at term v in document j, y.. = J2'i=i Ylv=i Vj^ 



the total number of held-out words, and s 



S are the indices of collected samples. Note 



that the per-word perplexity is equal to V if fjv = l/V, thus it should be no greater than V 
for a functional topic model. The final results are averaged from five random training/testing 
partitions. The performance measure is the same as in [|7]| and also similar to those used in {6E\, 
ll67ll . [|23l . Note that the perplexity per held-out word is a fair metric to compare topic models. 
However, when the actual Poisson rates or NB distribution parameters for counts instead of the 
mixture proportions are of interest, a NB process based joint count and mixture model would 
be more appropriate than a Dirichlet process or an HDP based mixture model. 

We show in Fig.[T]the NB dispersion and probability parameters learned by various NB process 
topic models listed in Table |I} revealing distinct sharing mechanisms and model properties. In 
Fig. [2] we compare the per-held-out-word prediction performance of various algorithms. We set 
the parameters as c = 1, 77 = 0.05 and oq = &o = eo = /o = 0.01. For LDA and NB-LDA, we 
search K for optimal performance and for the others, we set K = 400 as an upper-bound. All 
the other NB process topic models are nonparametric Bayesian algorithms that can automatically 
learn the number of active topics for a given corpus. When 9jk = rk is used, as in the NB 
process, different documents are imposed to have the same topic weights, leading to the worst 
held-out-prediction performance. 
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Fig. 2. Comparison of per-word perplexity on held out words between various algorithms listed in Table|l]on the Psychological 
Review corpus. LDA-Optimal-a refers to an LDA algorithm whose topic proportion Dirichlet concentration parameter a is 
optimized based on the results of the CRF-HDP on the same dataset. (a) With 60% of the words in each document used for 
training, the performance varies as a function of K in both LDA and NB-LDA, which are parametric models, whereas the 
NB, Beta-Geometric, NB-HDP, NB-FTM, Beta-NB, CRF-HDP, Gamma-NB and Marked-Beta-NB all infer the number of active 
topics, which are 225, 28, 127, 201, 107, 161, 177 and 130, respectively, according to the last Gibbs sampling iteration, (b) 
Per-word perplexities of various algorithms as a function of the percentage of words in each document used for training. The 
results of LDA and NB-LDA are shown with the best settings of K under each training/testing partition. Nonparametric Bayesian 
algorithms listed in Table |l] are ranked in the legend from top to bottom according to their overall performance. 



With a symmetric Dirichlet prior Dir(a/i^, ■ ■ ■ ,a/K) placed on the topic proportion for each 
document, the parametric LDA is found to be sensitive to both the number of topics K and the 
value of the concentration parameter a. We consider a = 50, following the suggestion of the 
topic model toolbox!^ provided for [[65l : we also consider an optimized value as a = 2.5, based 
on the results of the CRF-HDP on the same dataset. As shown in Fig. |2} when the number 
of training words is small, with optimized K and a, the parametric LDA can approach the 
performance of the nonparametric CRF-HDP; as the number of training words increases, the 
advantage of learning f^. in the CRF-HDP than fixing f^. = 1/K in LDA becomes clearer. The 
concentration parameter a is important for both LDA and CRF-HDP since it controls the VMR 
of the count rijk, which is equal to (1 — ffc)(a + Nj)/(a + 1) based on (59). Thus fixing a 



may lead to significantly under- or overestimated variations and then degraded performance, as 
evident by comparing the performance of LDA with a = 50 and LDA-Optima-a in Fig. |2j 
When {rj,pj) is used, as in NB-LDA, different documents are weakly coupled with rj ~ 
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Gamma(7o, ^Ic), and the modeling results in Fig. [T] show that a typical document in this corpus 
usually has a small and a large pj, thus a large ODL and a large VMR, indicating highly 
overdispersed counts on its topic usage. NB-LDA is a parametric topic model that requires tuning 
the number of topics K. It improves over LDA in that it only has to tune K, whereas LDA 
has to tune both K and a. With an appropriate K, the parametric NB-LDA may outperform the 
nonparametric NB-HDP and NB-FTM as the training data percentage increases, showing that 
even by learning both the NB dispersion and probability parameters Vj and pj in a document 
dependent manner, we may get better data fitting than using nonparametric models that share 
the NB dispersion parameters across documents, but fix the NB probability parameters. 

When {rj,pk) is used to model the latent counts {njk}j^k, as in the Beta-NB process, the 
transition between active and non-active topics is very sharp that p^ is either far from zero 
or almost zero. That is because pk controls the mean as EE^^jA;] = ~ PkjYlj^j ^^'^ 

the VMR as (1 — Pk)^^ on topic k, thus a popular topic must also have large pk and thus 
large overdispersion measured by the VMR; since the counts {njk}j are usually overdispersed, 
particularly true in this corpus, a small pk indicating an small mean and small overdispersion is 
not favored by the model and thus is rarely observed. 

The Beta-Geometric process is a special case of the Beta-NB process that Vj = 1, which is 
more than ten times larger than the values inferred by the Beta-NB process on this corpus, as 
shown in Fig. |T| therefore, to fit the mean IE[X]j "^jk] = JPk/ {^—Pk), it has to use a substantially 
underestimated pk, leading to severely underestimated variations and thus degraded performance, 
as confirmed by comparing the curves of the Beta-Geometric and Beta-NB processes in Fig. |2j 

When {rk,Pj) is used, as in the Gamma-NB process, the transition is much smoother that 
Tfc gradually decreases. The reason is that controls the mean as EfXlj^jfe] = ''"kJ2jPj/i^ " 
Pj) and the ODL r^^ on topic k, thus popular topics must also have large and thus small 
overdispersion measured by the ODL, and unpopular topics are modeled with small and 
thus large overdispersion, allowing rarely and lightly used topics. Therefore, we can expect that 
{rkiPj) would allow more topics than (rj,pk), as confirmed in Fig. |2] (a) that the Gamma-NB 
process learns 177 active topics, significantly more than the 107 ones of the Beta-NB process. 
With these analysis, we can conclude that the mean and the amount of overdispersion (measure by 
the VMR or ODL) for the usage of topic k is positively correlated under (rj,pk) and negatively 
correlated under (rk^pj). 
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The NB-HDP is a special case of the Gamma- NB process that pj = 0.5. From a mixture 
modeling viewpoint, fixing pj = 0.5 is a natural choice as pj becomes irrelevant after normal- 
ization. However, from a count modeling viewpoint, this would make a restrictive assumption 



that each count vector has the same VMR of 2. It is also interesting to examine (41 1, 

which can be viewed as the concentration parameter a in the HDP, allowing the adjustment of 
Pj would allow a more flexible model assumption on the amount of variations between the topic 
proportions, and thus potentially better data fitting. 

The CRF-HDP and Gamma-NB process have very similar performance on predicting held- 
out words, although they have distinct assumption on count modeling: Ujk is modeled as a NB 
distribution in the Gamma-NB process while it is modeled as a beta-binomial distribution in the 
CRF-HDP. The Gamma-NB process adjust both and pj to fit the NB distribution, whereas the 
CRF-HDP learns both a and to fit the beta-binomial distribution. The concentration parameter 



a controls the VMR of the count rijk as shown in (59), and we find through experiments that 
prefixing its value may substantially degrade the performance of the CRF-HDP, thus this option 
is not considered in the paper and we exploit the CRF metaphor to update a as in [|2T|. (23. 

When (rfc, tt^) is used, as in the NB-FTM model, our results show that we usually have a 
small TTfe and a large r^, indicating topic k is sparsely used across the documents but once it is 
used, the amount of variation on usage is small. This modeling properties might be helpful when 
there are excessive number of zeros which might not be well modeled by the NB process alone. 
In our experiments, we find the more direct approaches of using p^ or pj generally yield better 
results, but this might not be the case when excessive number of zeros are better explained with 
the underlying beta-Bernoulli processes, e.g., when the training words are scarce, the NB-HDP 
can approach the performance of the Marked-Beta- NB process. 

When (rfc,pfc) is used, as in the Marked-Beta-NB process, more diverse combinations of 
mean and overdispersion would be allowed as both and pk are now responsible for the 
mean IE[X]j "^jfc] = J^kPk/{^ ~Pk)- For example, there could be not only large mean and small 
overdispersion (large and small p^), indicating a popular topic frequently used by most of the 
documents, but also large mean and large overdispersion (small r/c and large pk), indicating a 
topic heavily used in a relatively small percentage of documents. Thus {vk^pk) may combine the 
advantages of using only or pk to model topic k, as confirmed by the superior performance 
of the Marked-Beta-NB over the Beta-NB and Gamma-NB processes. 
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IX. Conclusions 

We propose a variety of negative binomial (NB) processes for count modeling, which can 
be naturally applied for the seemingly disjoint problem of mixture modeling. The proposed 
NB processes are completely random measures, which assign independent random variables to 
disjoint Borel sets of the measure space, as opposed to the Dirichlet process and the hierarchical 
Dirichlet process (HDP), whose measures on disjoint Borel sets are negatively correlated. We 
reveal connections between various distributions and discover unique data augmentation methods 
for the NB distribution, with which we are able to unite count and mixture modeling, analyze 
fundamental model properties, and derive efficient Bayesian inference using Gibbs sampling. 
We demonstrate that the NB process and the gamma-NB process can be normalized to produce 
the Dirichlet process and the HDP, respectively. We show in detail the theoretical, structural 
and computational advantages of the NB process. We examine the distinct sharing mechanisms 
and model properties of various NB processes, with connections made to existing discrete latent 
variable models under the Poisson factor analysis framework. Experimental results on topic mod- 
eling show the importance of modeling both the NB dispersion and probability parameters, which 
respectively govern the overdispersion level and variance-to-mean ratio for count modeling. 
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Appendix A 
Chinese Restaurant Table Distribution 

Lemma A.l. A CRT random variable I ~ CRT(m, r) with PMF fLillm, r) — f^^^\s{m, I) |r^, I 
0, 1, • • • ,m can be generated as 

I = > bn, bn ~ Bernoulli I ) . (61) 

\n-l + r J 

n=l ^ ' 

Proof: Since / is the summation of independent Bernoulli random variables, its PGF becomes 

Cl{z) = nr=i + ^■^) = fSo ELo mrz)'. Thus we have A(/|m,r) = 

^ = rgokKOkS / = 0,l,---,m. ■ 



Corollary A.2. If fL{l\m,r) = Y^^\s{m,l)\r\ / = 0, 1, ■ ■ ■ ,m, i.e. I ~ CRT(m,r), then 

E[l\m,r]=J2 Var[/|m,r] = ^ 7 (^2) 

n=l n=l ^ ' 

fl«<i approximately we have the mean and variance as 

Although / ~ CRT(m, r) can be generated as the summation of independent Bernoulli random 
variables, it may be desirable to directly calculate out its PMF in some case. However, it is 
numerically instable to recursively calculate the unsigned Stirling numbers of the first kind 
\s{m,l)\ based on \s{m,l)\ — {m — l)\s{m — + \s{m — 1,Z — 1)|, as \s{m,l)\ would 
rapidly reach the maximum value allowed by a finite precision machine as m increases. Denote 
Pr{m,l) = p^^|s(m, Z)|r', then is a probabiUty matrix as a function of r, each row of 
which sums to one, with Pr{0, 0) = 1, Pr{m, 0) = if m > and Pr{m, I) = if I > m. We 
propose to calculate Pr{m, I) under the logarithmic scale based on 

In P^(m,Z) = lnPi(m,0 + nn(r) +lnr(r) - lnr(m + r) + lnr(m + 1) (64) 

where In Pi (m, /) is iteratively calculated with In Pi (m, 1) = In ^^+lnPi(m— 1, l),lnPi(m,/) - 
In ^ + lnPi(m - 1, + ln(l + exp(lnPi(m - 1, Z - 1) - In Pi (m - 1, 1) - ln(m - 1))) for 
2 < I < m — 1, and lnPi(m, m) = In Pi (m — 1, m — 1) — Inm. This approach is found to be 
numerically stable, but it requires calculating and storing the matrix Pi, which would be time 
and memory consuming when m is large. 
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Appendix B 

Model and Inference for Negative Binomial Process Topic Models 
A. CRF-HDP 

The CRF-HDP model [7, 26] is constructed as 

Xji - (f)^^ - Dir(7;, • • • , 77), Zji ^ Discrete(Aj) 

Aj~Dir(Q;f), a ~ Gamma(ao, l/&o), ^ ~ Dir(7o/-fr, • • • , 70/i^). (65) 

Under the CRP metaphor, denote Ujk as the number of customers eating dish k in restaurant 
j and Ijk as the number of tables serving dish k in restaurant j, the direct assignment block 
Gibbs sampling can be expressed as 



^r{zji = k\-) (X (pvjik^k 

{ljk\—) ~ CRT(njfe, afk), wj ~ Beta(Q; + 1, Nj), Sj ~ Bernoulli 



a ~ Gamma 



/ J K J ^ \ 

\ j=i k=i j=i ^ ^3 3 J 



iV, 



TV,- + Q! 



(J J 

(Aj|-) ~ Dir (afi + riji, • • • , avK + riji^r) 

(</)jt|-) ~ Dir (r? + ni.jk, • • • , + ny.fe) . (66) 
When ^ 00, the concentration parameter 70 can be sampled as 

I ^x^, ^ eo + K+-l 
^o~Beta 70 + 1,2^2^ , t^^ = - . . ^00 , 

7o ~ TToGamma f cq + , — \ ) + (1 - 7ro)Gamma ( cq + - 1, \ ] (67) 

\ /o-lntwo/ V /o-lnwo/ 

where is the number of used atoms. Since it is infeasible in practice to let X — > 00, directly 
using this method to sample 70 is only approximately correct, which may result in a biased 
estimate especially if K is not set large enough. Thus in the experiments, we do not sample 

7o and fix it as one. Note that for implementation convenience, it is also common to fix the 
concentration parameter a as one [25]. We find through experiments that learning this parameter 
usually results in obviously lower per-word perplexity for held out words, thus we allow the 
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learning of a using the data augmentation method proposed in [7], which is modified from the 
one proposed in [24]. 

B. NB-LDA 

The NB-LDA model is constructed as 

Xji - ~ Dir(77, • • • , 77) 

K 

^3 = '^jk-, rijk ~ Pois(^jfc), ^jfc ~ Gamma(rj,pj/(1 - pj)) 

k=l 

Vj ~ Gamma(7o, 1/c), ~ Beta(ao, bo), 70 ~ Gamma(eo, I//0) (68) 

Note that letting rj ~ Gamma(7o, 1/c), 70 ~ Gamma(eo, I//0) allows different documents to 
share statistical strength for inferring their NB dispersion parameters. 
The block Gibbs sampling can be expressed as 

—K\n(l — n) 

{pj I -) - Beta (ao + A^^-, bo + Krj) , p' - v Fjy 



c — i^ln(l — Pj) 

K 



{ljk\-) ~ CRT{njk, Tj), Ij ~ CRT(^ ljk,1o), 7o ~ Gamma Uo + ^ I'j, 



k=l 

K 



i "/o-e;=iMi-p;) 



(rj|-) ~ Gamma |^7o + Z:,^, ^ - inn(i -p ) j ' ^ Gamma(rj + Ujk.Pj) 

{(f)k\-) ~Dir(?7 + ni.fe,-- - ,r] + nv.k)- (69) 

C. NB-HDP 

The NB-HDP model is a special case of the Gamma-NB process model with pj = 0.5. The 
hierarchical model and inference for the Gamma-NB process are shown in (42) and (43) of the 
main paper, respectively. 
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D. NB-FTM 

The NB-FTM model is a special case of zero-inflated NB process with pj = 0.5, which is 
constructed as 

Xji ~ F(0^^J, ~ Dir(r/, ■ ■ ■ , r/) 
= X^^fc, ~ Pois(ejfc) 

A:=l 

~ Gamma(rfc6jfc, 0.5/ (1 — 0.5)) 
Tk ~ Gamma(7o, 1/c), 70 ~ Gamma(eo, I//0) 

bjk ~ Bernoulli(7rA,.), tt^ ~ Beta(c//s:, c(l - l/K)). (70) 
The block Gibbs sampling can be expressed as 

Vv{zji = k\-) oc (pvj.kOjk 

b,, ~ 6{n,, = 0)Bernoulli (^ ,^,(, _ q 5)., ^ _ ) + ^{n,, > 0) 



/ v4 v4 \ - V.&,fcln(l - 0.5) 

(^ifc|-) ~ CKT{njk,rkbjk), {l'k\~) ~ CRT ^/jfc,7o 



(7o|-) ~ Gamma Uo + ^ Z^, 



1 



fcit /o - Ef=i ln(l - P'fe) 



(r.H ~ Gamma ( + g ^ _ ^jj^^^, _ , 



yjk\ 



Gamma(rfe6jfc + Uj^, 0.5) 



{(Pk\-) ~Dir(r/ + ni.fc,--- ,r/ + ny.fc). (71) 

E. Beta-Negative Binomial Process 

We consider a beta-NB process that the NB probability measure is shared and drawn from 



a beta process while the NB dispersion parameters are group dependent. As in Section II-A3 
a draw from the beta process B ~ BF(c, Bq) can be expressed as 5 = J^kLiPk^^k^ '^hus a 
beta-NB process can be constructed as Xj ~ NBP(rj, B), with a random draw expressed as 
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oo 

= X^'^jfc^o^fc, rijk ~ NB(rj,pfc)- (72) 

fc=i 

1) Posterior Analysis: Assume we already observe {Xj}ij and a set of discrete atoms V = 
{^k}i,K- Since the beta and NB distributions are conjugate, at an observed discrete atom cuk G V, 
with pk = B{uk) and n^k = Xj{uk), we have pk\{rj,Xj}ij ~ Beta (e/=i ^jfc) c + X;/=i ^i) • 
For the continuous part ^l\D, the Levy measure can be expressed as u(dpdu)\{rj,Xj}ij = 
cp^^ {1 — pY'^^1^^ ''^^^ dpBQ^duj) . Following the notation in [1681 . Il27]| . [fTTTl . we have the posterior 
of the beta process as 

B\{r,, ~ BP (c + Y.U ^^Efe^o + ^^rfe ^^=1 ^^'^'^-^ ' ^^^^ 

Placing a gamma prior Gamma(co, l/cio) on r^, we have 

ljuVj, Xj ~ CRT(njfc, Tj), rj\{ljk]k, B ~ Gamma [cq + ^f^^ /^-fc, • (74) 

Note that if are fixed as one, then the beta-NB process reduces to the beta-geometric process 
discussed in [llj, and if are empirically set to some other values, then the beta-NB process 
reduces to the one proposed in [13J. These simplifications are not considered in the paper, as 
they are often overly restrictive. 

With a discrete base measure Bq = J2k=i j<^4'k' beta-NB process topic model is con- 
structed as 

Xji^F{(f)^J, ~ Dir(r/, ■ ■ ■ , r/) 

K 

= X] '^jk, rijk ~ Pois(6'jfe), 9jk ~ Gamma{r j,pk/ (1 - Pk)) 

k=l 

r,- ~Gamma(eo,l//o), Pfc ~ Beta(c/ir, c(l - iT)) (75) 
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The block Gibbs sampling can be expressed as 

'Prizji = k\-) (X (pvj.kOjk 

{Pk I -) ~ Beta [c/K + ^ n.^, c(l - l/iT) + ^ r, ) , l^k ~ CRT(n,fc, r^] 

1 



J 



i=i j=i 

K 



rj\-) ~ Gamma eo + ^ Ijk 



k=i /o - Ef=i - Pfc) 



~ Gamma(rj + rijk.Pk) 
{4>k\-) ~ Dir(?7 + rii.fc, ■ ■ ■ ,77 + ny.fc) . (76) 

i^! Marked-Beta-Negative Binomial Process 

We may also consider a marked-beta-NB process that both the probability and dispersion 
measures are shared, in which each random point {cok,Pk) of the beta process is marked with an 
independent gamma random variable taking values in IR+. Using the marked Poisson process 
theorem [[SI, we may regard {R, B) = Yl'^=ii''"k,Pk)Sivk ^ random draw from a marked beta 
process defined in the product space [0, 1] x ]R_|_ x with Levy measure 

v{dpdrdu) = cp-\l - py-^dpRo{dr)BQ{duj) (77) 

where Rq is a continuous finite measure over ]R_|_. A marked-beta-NB process can be constructed 

by letting Xj ~ NBP(i?, B), with a random draw expressed as 

00 

Xj = ^ rijkSuj^, rijk ~ NB(rfc,pfc). (78) 

k=l 

1) Posterior Analysis: At an observed discrete atom cuk E V, with = R{ujk), we have 
Pk\R, {Xj}i^j ~ Beta (^j=injk,c + Jrk^ . For the continuous part Q\V, with r = R{u) for 
u G Vt\V, we have u{dpduj)\R, {Xj}i^j = cp'^{l — pY~^'^^~^dpBo(dco) . Thus the posterior of B 
can be expressed as 

B\R, {X,h,j ~ BP (c, fjBo + ^ Eii E/=i ^.fc^...) (79) 

where cj is the concentration function as cj(w) = c+JR{u)+J2j=iXj{^)-^^^ Ro{d''^)/Ro(M+) = 
Gamma(r; eo, 1/ fo)dr, then for G V, we have 

^jfcl-R, -^j ~ CRI{njk, rk), rk\{ljk}j=i,j, B ~ Gamma ^cq + Z]/=i ^jfc, /o_jin(i-pj^) 

(80) 
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and for uu e Q,\D, the posterior of r — R{uj) is the same as the prior r ~ Gamma(eo, l//o)- 

With a discrete base measure Bq = ^^j^ jc^4>k' Marked-Beta-NB process topic model is 
constructed as 



J 



(r.l-) Gamma eo + ^ I,,, 



Xji ~ F{(f)^..), (f)^ ~ Dir(?7, ■■■ ,r]) 

K 

^3 = "-jfe ~ Pois(6'jfe), Ojk ~ Gamma(rfe,pfe/(1 - Pfe)) 

k=l 

rjfc~Gamma(eo,l//o), Pfe - Beta(c/i^, c(l - iC)) (81) 
The block Gibbs sampling can be expressed as 

Pk Beta ^c/K + ^ rijk, c(l - + Jvk] , Ijk ~ CRT(njfc, Vk) 

1 

{Ojk\-) ~ Gamma(rjk + njfe,Pfe) 

i(t>k\-) ~Dir(r^ + ni.fc,-- - ,r; + ny.fc). (82) 

G. Marked-Gamma-Negative Binomial Process 

We may also consider a marked-gamma-NB process that each random point {rk,ojk) of the 
gamma process is marked with an independent beta random variable pk taking values in [0, 1]. 
We may regard {G,P) — YlV=i{'^k,Pk)^tjjk a random draw from a marked gamma process 
defined in the product space IR+ x [0, 1] x f2, with Levy measure 

u{drdpduj) = r-h-^'drPo{dp)Go{duj) (83) 

where Pq is a continuous finite measure over [0, 1]. 

1 ) Posterior Analysis: At an observed discrete atom ujk G V, we have 

ljk\G, Xj ~ CRT{njk, Vk), rk\{ljk}j=i,j, P ~ Gamma (e/=i Ijk, c-j in(i-pfc) ) (^4) 

where = G{uk) and = P{ujk)- For the continuous part with p = P{cu) for cu E ^lyv, 
the Levy measure of G can be expressed as i'{drdu!)\P, = r~^e~^'^~'^^'^^^~P^^^drGo{duj). 

Thus the posterior of G can be expressed as 
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G\P, {X^h,j ~ GaP (cj, Go + Ef=i E/=i IjkS.,) (85) 

where cj is the concentration function as cj(a;) — c — Jln(l — P{uj)). Let Po{dp) / Po{[0, 1]) = 
Beta(p; ao, bo)dp, then for e T>, we have 

{^i}i,J ~ Beta (^ao + rijk, bo + Jr^) (86) 

and for u e Q\D, the posterior of p = -P(^) is the same as the prior p ~ Beta(ao, bo). 

With a discrete base measure Go = J2k=i ^^4>k' Marked-Gamma-NB process topic model 
is constructed as 

Xji ~ F{(f>^..), </)fe ~ Dir(77, • • • , 77) 

K 

^ XI "'J'^ Pois(^jfe), ^'jifc ~ Gamma(rfe,pjk/(1 - Pk)) 

k=l 

Tfe ~ Gamma(7o/-fr, 1/c), ~ Beta(ao, 60), 7o ~ Gamma(eo, l//o). (87) 
The block Gibbs sampling can be expressed as 



/ , ^ \ , — Jln(l— pfc) 

Pk ~ Beta \^ao + 2^ 60 + Jrkj, Pk ^ c-Jln{l-pk) 

Ijk ~ CRT(njfc, rjfc), Zj^ ~ CRT(^ /^-fc, 70/ir), 70 ~ Gamma [eo + Y^ I'k, w-, TVJT; 

(rjk|-) ~ Gamma ^7o/-f^ + ^ ^jfe, ^^j^^^^-^^ j , {Ojk\-) ~ Gamma(rfe + njjk,pfe) 
(<Afe|-) ~Dir(ry + ni.fe,-- - ,r] + nv.k)- (88) 
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